Order independent structural alignment of circularly permuted proteins 
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Abstract-Circular permutation connects the N and C ter- 
mini of a protein and concurrently cleaves elsewhere in the 
chain, providing an important mechanism for generating novel 
protein fold and functions. However, their in genomes is un- 
known because current detection methods can miss many oc- 
curances, mistaking random repeats as circular permutation. 
Here we develop a method for detecting circularly permuted 
proteins from structural comparison. Sequence order indepen- 
dent alignment of protein structures can be regarded as a special 
case of the maximum-weight independent set problem, which is 
known to be computationally hard. We develop an efficient ap- 
proximation algorithm by repeatedly solving relaxations of an 
appropriate intermediate integer programming formulation, we 
show that the approximation ratio is much better then the theo- 
retical worst case ratio of r — 1/4. Circularly permuted proteins 
reported in literature can be identified rapidly with our method, 
while they escape the detection by publicly available servers for 
structural alignment. 

Keywords-circular permuations, integer programming, lin- 
ear programming, protein structure alignment 



Introduction 

A circularly permuted protein arises from ligation of the N 
and C termini of a protein and concurrent cleavage elsewhere 
in the chain [1,2]. In nature, circular permutation often orig- 
inate from tandem repeats via duplication of the C-terminal 
of one repeat together with the N-terminal of the next re- 
peat, as is the case of swaposin. Another mechanism is liga- 
tion and cleavage of peptide chains during post-translational 
modification, as is the case of concanavalin A. The full ex- 
tent of circular permutation and thier biological consequences 
are currently unknown. Discovery of circular permutation at 
genome wide scale will enable systematic studies of its con- 
tribution to the generation of novel protein function and novel 
protein fold. 

Currently, sequence alignment based methods are the 
only available tools for detecting circular permutations. 
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These include postprocessing output from standard dynamic 
programming methods, as well as customized algorithms [3]. 
Sequence based methods can miss many circularly permuted 
proteins, because either one or both fragments may escape 
detection by local alignment if the two proteins are distantly 
related. 

Protein structures are far more conserved than sequences, 
and they can reveal very distant evolutionary relationship [4]. 
Jung et. al. [5] showed that there is potential to discover re- 
motely related protein domains by artifically permutating pro- 
teins and superimposing them on native domains in the pro- 
tein structure database. Detecting circular permutation from 
structures has the promise to uncover many more ancient per- 
mutation events that escape sequence methods. 

In this study, we describe a new algorithm for detecting 
circular permutation in protein structures. We show with ex- 
amples that our algorithm can align circular permutations re- 
ported in literature. Our work introduces an efficient approx- 
imate method for protein substructure comparison. Two pro- 
tein structures are first cut into pieces exhaustively of vary- 
ing lengths and then compared. An approximation algorithm 
is used to search for optimal combination of peptide pieces 
from both structures. With the special nature of spatial dis- 
tribution of proteins, a fractional version of local-ratio ap- 
proach for scheduling split-interval graphs works well for de- 
tecting very similar spatial substructures. Our experimenta- 
tion showed that the approximation ratio is excellent, with an 
average value of at least 1 /2.53. 



Structural Alignment of Circularly Permuted 
Proteins 

Three simplified protein structures that are related by circular 
permutation are shown in Figure 1 . The three corresponding 
domains (labeled A, B, and C, respectively) are all very sim- 
ilar across proteins. The global spatial arrangements of these 
three domains are also very similar. However, the orderings 
of these domains in primary sequence are completely differ- 
ent. 

The discontinuity and different ordering of spatially 
neighboring domains in primary sequence makes the detec- 
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Table I: The LP formulation of the BSSI problem. 



Figure 1: Three cartoon protein structures related by a circular per- 
mutation. The location of domains A,B,C are shown in a primary se- 
quence layout below each structure. 



tion of global structural similarity. Most structural alignment 
methods are unable to connect the break where circular per- 
mutation occurs, and would prematurely terminates the align- 
ment. In addition, the reverse ordering of residues within a 
domain between two proteins also poses challenge for such 
methods. For example, the A domain in Figure la is ordered 
from N to C. The same domain in Figure lb is ordered from 
C to N. The algorithm outlined below is well-suited to solve 
these types of challenging problems. 

Basic Methodology We first introduce some notations: A 
substructure A a of a protein structure S a is a continuous frag- 
ment A a = (ai, (12) E A a , where a\, 0,2 are the residue num- 
bers of the beginning and end of the substructure, respec- 
tively, and 1 < a\ < 0,2 < |«S a |. A residue t G A a is part 
of substructure A a if a\ < t < 0,2- A a is the set of all possi- 
ble continuous substructures or fragments of protein structure 
S a , i.e., A a = {(ai, a 2 )}. S b , A 6 , 61, 6 2 , and A fc are similarly 
defined. 

The problem of protein substructure comparison is cap- 
tured in the following Basic Substructure Similarity Identifi- 
cation (BSSIa,<j) problem. 

Definition 1 Instance: a set A C A a X A5 of ordered pairs of substructures 
of Si and £2 and a similarity function a : A — > mapping these 
pairs of substructures to similarity values (non-negative real numbers). 

Valid solutions: a set {Ai, A2, • • ■ , A^-} = 

{(A a ,i, A m)> ( a <i,2, A 6i2 ), ■ ■ ■ , (A a ,fc, A b , fe )} of pairs of sub- 
structures such that \g i S Af and \g j £ A^ are disjoint for 
I S {a, b} and i ^ j. 

Objective: maximize the total similarity of the selection 

The BSSIa.ct problem is a special case of the famous 
maximum-weight independent set (MWIS) problem in graph 
theory [6]. BSSIa j(T is MAX-SNP-hard even when the sub- 
structures are restricted to short lengths [7] 1 . Our approach 
is to adopt the approximation algorithm for scheduling split- 
interval graphs [7], which is based on a fractional version of 
the local-ratio approach. We introduce the following defini- 
tions: 

• The conflict graph Ga = (Va, Ea) for any subset A C A, is 
a graph in which Va = A and Ea consists of all distinct pairs 
of vertices {(A a , Xb), (X' a , X' b )} from Va such that Ai and X[ 
are not disjoint for some i g {a, b}. 
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The closed neighborhood of a vertex v of Ga, denoted by 
NbrA [v], is defined as {u \ {u, v} £ Ea} U {v}. 

The recursive algorithm for solving BSSIa,<j is as fol- 



lows: 



Remove every substructure pairs X = (X a , Xt) £ A such that 
u(X a , Xb) < 0. If A = after these removals, then return as 
the solution. 

Solve a linear programming (LP) formulation of the BSSIa. ct 
problem by relaxing a corresponding integer programming 
version of the BSSIa, ct problem. For every A = (A a , Af,) G A, 
introduce three indicator variables x\, yx.x a and yx.x b £ 
{0, 1}, but relaxed to real numbers. The LP formulation is 
shown in Table I. 

For every vertex A g A of Ga, compute its local conflict 
number qj ( a,„ = S x M . Let A m i n be a vertex with 
MeNbr A [A] 

the minimum local conflict number rA, CT = rninA{aA,A,o-}. 
Define a new similarity function <j 11cw from a as follows 
<t(A), if v g Nbr A [A min ] 

a(A) — <r(A m in), otherwise 



1 A maximization problem being MAX-SNP-hard implies that there exists 
a constant < e < 1 such that no polynomial time algorithm can return a 
solution with a value of at least (1 — e) times the optimum value. 



• Recursively solve the BSSlA,<r now problem using the same ap- 
proach. Let A' C A be the solution returned. 

• If the substructure pair A m i n can be selected together with all 
the pairs in A' then return A' U {A m i n } as the solution, else 
return A' as the solution. 

A brief explanation of the various inequalities in the LP 
formulation as described above is as follows: 

• The interval clique inequalities in (1) (resp. (2)) ensures that 
the various substructures of S a (resp. St) in the selected pairs 
of substructures from A are mutually disjoint. 

• Inequalities in (3) and (4) ensure consistencies between the in- 
dicator variable for substructure pair A and its two substruc- 
tures A a and A;,. 

• Inequalities in (5) ensure non-negativity of the indicator vari- 
ables. 




Figure 2: A simplified representation of the Basic Substructure Sim- 
ilarity Identification problem: a) The cartoon representation of circu- 
larly permuted proteins S a and S b , b) The problem represented as a 
graph where each node, A, represents an aligned fragment pair and 
each edge represents a conflict, where the same residues are con- 
tained in different fragments and c) An illustration how sweep lines 
(dashed) can identify overlapping aligned pairs. 

In implementation, the graph Ga is considered implic- 
itly via intersecting intervals. The interval clique inequali- 
ties can be generated via a sweepline approach. The running 
time depends on the number of recursive calls and the times 
needed to solve the LP formulations. Let LP(n, m) denote the 
time taken to solve a linear programming problem on n vari- 
ables and m inequalities. Then the worst-case running time 
of the above algorithm is 0(|A|-LP(3|A|,5|A| + |Ai| + |A 2 |)). 
However, the worst-case time complexity happens under the 
excessive pessimistic assumption that each recursive call re- 
moves exactly one vertex of Ga, namely A m i n only, from 
consideration, which is unlikely to occur in practice as our 
computational results show. 

The performance ratio of our algorithm is as follows. Let 
r be the maximum of all the ta.o-'s in all the recursive calls. 
Proofs in [7] translate to the fact that r < 4 and the above 
algorithm returns a solution whose total similarity is at least 
- times that of the optimum. The value of r is much smaller 
compared to 4 in practice {e.g., r = 2.53). 

Implementation and Computational Details. We present a 
simplified example for illustration two protein structures S a 
(Figure 2a) and Sb (Figure 2a) are selected for alignment. 
Here Sb is the structure to be aligned to the reference structure 
S a . We systematically cut Sb into fragments of length 7-25 
and exhaustively compute a similarity score of each fragment 
from Sb to all possible fragments of equal length in S a - Each 
fragment pair can be thought of as a vertex in a graph, as 
shown in Figure 2b. 

Suppose we have the following similarity scores for 
aligned substructures: o-(Ai) = o-((l,3)(l',3')) = 8,o-(A 2 ) = 

<7((4,5)(4',5')) = 5,<t(A 3 ) = <t((4,7)(4',7')) = 7,<t(A 4 ) = 
<t((6,8)(6',8')) = 3, andcr(A 5 ) = ct((9, 10)(9',10')) = 6. 

We can describe the problem of selecting the best struc- 
tural fragment pairs as to maximize 8x\ 1 + 5x\ 2 + 7x\ 3 + 
3x Xi + 6xx 5 . 



Table II: The constraints of the consflict graph for the set of frag- 
ments in Figure 2c. 



Figure 2b shows the conflict graph for the set of frag- 
ments. A sweep line (shown as dashed lines in Figure 2c) is 
implicitly constructed (0(n) time after sorting) to determine 
which vertices of fragment pair overlap. A conflict is shown 
in Figure 2b as edges between vertices. Vertices Ai and A5 
do not conflict with any other fragments, while A2 and A4 
conflict with A3 . For this graph, the constraints in the linear 
programming formulation are shown in Table II. The linear 
programming problem is solved using the BPMPD package 
[8]. 

The algorithm guarantees that there is a vertex Ai that 
satisfies x Xi + X)a eNbr[A ] - 4 ' an ^ a ^ vertices are 
searched to find such a vertex. We then update a(Xj) for 
vertices that are neighbors of A^: o-new(Aj) = o-(Xj) - 
a-(Xi), if Xj g Nbr[Ai] Vertices that have no neighbors, such 
as Ai and A5 in our example, are included in our final solu- 
tion. Using the updated values of cr(Ai), we remove vertices 
of substructure pairs that have score less than 0, and recur- 
sively solve the (smaller) LP problem again. The selected 
final set of fragments are then used to construct a substruc- 
ture of Sb- They are then aligned structurally by minimizing 
cRMSD. 

Similarity score. The similarity score er(A) = er(A a ,Ab) 
between two aligned substructures A a and Ab is a 
weighted sum of a shape similarity value derived from 
cRMSD value and a sequence composition score (SCS): 

SCS = J2iLi B{A a .i, A b , i), where A a ^ and A b ,ii are the 
amino acid residue type at aligned position i, N the number 
of residues in the aligned fragment, and B(A a> i, Ab t i) the 
similarity score between A a ^ and Ab : i by the modified 
BLOSUM50 matrix, in which a constant is added to all 
entries such that the smallest entry is 1.0. The similarity 
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Table III: Results from the structural alignment of circularly per- 
muted protein, where N equal the number of aligned residues in the 
final solution. Human-made circularly permuted proteins are listed at 
the bottom separated by a line from naturally occurring proteins. 



PDB/Size PDB/Size 


Frag- N 
ments 


cRMSD 


lrin/180 2cna/237 
lrsy/121 lqas/123 
lnkl/78 lqdm/74 
lonr/316 lfba/360 
laqi/191 lboo/259 


3 45 

4 44 

6 48 

7 77 
4 66 


0.877 
1.107 
1.832 
2.444 
3.571 


lavd/123 lswg/112 
lgbg/214 lajk/212 


6 66 
5 110 


0.815 
0.347 




Figure 3: The structure of concanavalin A (2cna) (a) and its super- 
position to lectin from garden pea (lrin) (shown in gray) (b). The 
residues spanning the break due to the circular permutation are high- 
lighted in red. 



score <x(A , A&) is calculated as follows: 

, A , , a(C — cRMSD) + /3(SCS) 
a{\ a ,Ab) = — , 

For the current implementation, C — 60, a — 6, and (3=1. 

To improve computational efficiency, a threshold mea- 
sure is introduced to immediately exclude low scoring frag- 
ment pairs from further consideration. Only fragment pairs 
scoring above the threshold will be candidates to be included 
in the final solution. In the above example, we assume that 
only fragment pairs represented by vertices Ai, A2, A 3 , A4, 
and A 5 (Figure 2b) are above the threshold. 



Results 

We discuss the structural alignments of several well known 
examples of naturally occurring circularly permuted proteins. 

The results including other examples are summarized in 
Table III, which includes two additional examples of experi- 
mentally constructed human-made permuted proteins. None 
of them are found by existing servers. 

The first naturally occurring circular permutation was 
found in concanavalin A. and lectin favin. The structures of 
lectin from garden pea (lrin) and concanavalin A (2cna) 
(Figure 3a,b). In the structural alignment, three fragments 
align over 45 residues with a root mean square distance of 
0.82A. The superposition based on the aligned fragments is 
shown in Figure 3b. The residues spanning a break in the 
backbone due to the permutation are highlighted in red. 

Discussion 

The approximation algorithm introduced in this work can find 
good solutions for the problem of detecting circular permuted 
proteins. In contrast to methods based on sequence align- 
ment alone, the ability to incorporate both structural and se- 
quence similarity is critical. An experimentation in the align- 
ment of lrin and 2cna illustrates this point. After turning 
off the contribution from cRMSD in the similarity function 



c(A a , Ab) by setting a = 0, we find altogether 452 frag- 
ments that score in the top 10% of the set of all SCS scores. 
This number is too large as the size of the conflict graph will 
be large. This makes subsequent comparison very difficult. 
Similarly, cRMSD measurement alone is inadequate for com- 
parison. When matching fragments from lrin and 2cna, 
we find there are 287 fragments that can be aligned with an 
cRMSD value < 3.0 Aafter setting [3 = 0. The use of com- 
posite similarity function is therefore essential to reduce false 
positives in substructure alignments. 

In our method, scoring function plays pivotal role in de- 
tecting substructure similarity of proteins. Much experimen- 
tations are needed to optimize the similarity scoring system 
for general sequence order independent structural alignment. 
A necessary ingredient for a fully automated search of cir- 
cular permuted proteins in database is a statistical measure- 
ment of significance of matched substructures. Since cRMSD 
measurement and therefore er(A a , A&) strongly depends on the 
length of matched parts and the gaps of the sequence break, 
a statistical model assessing the p-value of a measured <r(A) 
against a null model needs to be developed. Circular permu- 
tations that are likely to be biologically significant can then 
be automatically declared [4]. 
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